## Additional implementation details
Considering the stability of learning process, we also provide the implementation of 'softmax' strategy when computing the target Q-value.

It is computed as follows:
$$
y_t^i = r_t^i + \gamma \sum_{a' \in \mathcal{A}} \frac{\exp(\hat{Q}_\theta(s_{t+1}^i, a', D^i_{t}) / \tau)}{\sum_{b \in \mathcal{A}} \exp(\hat{Q}_\theta(s^i_{t+1}, b, D^i_{t}) / \tau)} \hat{Q}_\theta(s^i_{t+1}, a', D_{t}^i)
$$
where $\tau$ is the temperature parameter, and is set to 1.

## Linear Stochastic bandit experiments
### Data Generation
You should first generate data using the following command:
```
python generate_data.py --env linear_bandit --algos random --num_envs 100000 --num_per_task 1
```
Arguments:
- env: the environment name, currently only linear_bandit is supported
- algos: the algorithm name, currently only linucb and random are supported
- num_envs: the number of different environments to generate data for
- num_per_task: the number of trajectories to generate for each environment

The data will be saved in the `data` folder.

### Training models
You can train models using the following command:
```
python pretrain_v2.py \
    --env linear_bandit\
    --dim 5\
    --num_envs 100000\
    --num_per_task 1\
    --source random\
    --lr 5e-6 \
    --min_lr 1e-7 \
    --num_epochs 250\
    --gamma 0.99\
    --double \
    --H 200 \
    --gpu 0\
    --use_wandb \
    --n_embd 256 \
    --weight_decay 0.001\
    --batch_size 128\
    --n_layer 8\
    --Q \
    --soft_max \
```
Arguments:
- Q: whether we use Q-learning in learning process
- double: whether we set double DQN in learning process
- H: the number of hidden units in the neural network
- gpu: the GPU id
- use_wandb: whether we use wandb to record the training process
- n_embd: the number of embedding units in the neural network
- weight_decay: the weight decay in the neural network
- batch_size: the batch size in the neural network
- n_layer: the number of layers in the neural network
- soft_max: whether we use softmax strategy when computing the target Q-value

### Evaluation
You can evaluate the models using the following command:
```
python experiment.py \
    --source random\
    --batch_size 200\
    --n_embd 256\
    --model_path <model_path>\
    --save_path <your save path>\
    --num_trajectories 1000\
    --gpu 0\
    --test_horizon 200\
    --n_layer 8\
    --Q \
    --greedy \
    --horizon 200\
```

## Darkroom experiments
### Data Generation
You should first generate data using the following command:
```
python generate_data.py --env darkroom  --dim 10 --mix 0.0
```
Arguments:
- dim: the dimension of the darkroom
- mix: the mix ratio of the expert rollin

The data will be saved in the `data` folder.

### Training models
You can train models using the following command:
```
python pretrain.py \
    --env darkroom\
    --dim 10\
    --mix 0.0\
    --soft_max \
    --Q \
```
### Evaluation
You can evaluate the models using the following command:
```
python eval.py --env darkroom --H 100 --mix 0.0 --soft_max --Q --stored_epoch <stored_epoch>
```

## Math Reasoning
We use `Qwen/Qwen2.5-1.5B-Instruct` as base model and implement Behavior Cloning and QTPO.

### Data Generation
You should first generating the offline dataset.
You can check `Math/test_shepherd_new.ipynb` and `Math/split_shepherd.ipynb` for more details.

### Behavior Cloning
You can train the model using the following command:
```
python Math/sft_instruct.py \
    --base_model Qwen/Qwen2.5-1.5B-Instruct \
    --data_path <your data path> \
    --seed 42 \
    --batch_size 4 \
    --lr 5e-7 \
    --num_epochs 1 \
    --r 8 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --target_modules q_proj k_proj v_proj o_proj \
    --modules_to_save wte lm_head \
    --gradient_accumulation_steps 32 \
    --use_scheduler \
    --store_model \
```
Arguments:
- base_model: the base model to use
- data_path: the path to the stored data
- seed: the seed to use
- batch_size: the batch size to use
- lr: the learning rate to use
- num_epochs: the number of epochs to use
- r: the number of rollouts to use
- lora_alpha: the alpha to use
- lora_dropout: the dropout to use
- target_modules: the target modules to use
- modules_to_save: the modules to save
- gradient_accumulation_steps: the gradient accumulation steps to use
- use_scheduler: whether to use scheduler
- store_model: whether to store the model

### QTPO
You can run QTPO after getting the behavior cloned model
```
python bppo.py \
    --base_model_name Qwen/Qwen2.5-1.5B-Instruct \
    --data_path <your data path> \
    --adapter_path <your behavior cloned model path> \
    --seed 42\
    --batch_size 4 \
    --lr 2e-7 \
    --clip_ratio 0.1 \
    --num_iter 5 \
    --num_batchs_per_iter 50 \
    --gradient_accumulation_steps 2 \
```
Arguments:
- base_model_name: the base model to use
- data_path: the path to the stored data
- adapter_path: the path to the behavior cloned model
- seed: the seed to use
- batch_size: the batch size to use
- lr: the learning rate to use
